Steve Elston
02/28/2021
Bayesian analysis is a contrast to frequentist methods
The objective of Bayesian analysis is to compute a posterior distribution
Contrasts with frequentist statistics with objective to compute a point estimate and confidence interval from a sample
Bayesian models allow expressing prior information as a prior distribution
The posterior distribution is said to quantify our current belief
Inference can be performed on the posterior distribution by finding the maximum a postiori (MAP) value and a credible interval
Predictions are made by simulating from the posterior distribution
Two functions must be defined to compute the posterior distribution
Several solution methods; we apply two possibilities
Use an analytical solution; posterior in family of conjugate prior
Grid sample over range of model parameter values
In both cases, goal is to compute the posterior distribution over the plausible range of parameters
What are the properties of the Beta distribution?
Beta distribution for different parameter values
Consider the product of a Binomial likelihood and a Beta prior
\[\begin{align} posterior(\theta | z, n) &= \frac{likelihood(z,n | \theta)\ prior(\theta)}{data\ distribution (z,n)} \\ Beta(\theta | z, n, a, b) &= \frac{Binomial(z,n | \theta)\ Beta(a+1, b+1)}{p(z,n)} \\ &= Beta(z + a -1,\ n-z+b-1) \end{align}\]
Consider example with:
- Prior pseudo counts \([1,9]\), successes \(a = 1 + 1\) and failures, \(b = 9 + 1\)
- Evidence, successes \(= 10\) and failures, \(= 30\)
- Posterior is \(Beta(10 + 2 -1,\ 40 - 10 + 10 -1) = Beta(11,\ 39)\)
Prior, likelihood and posterior for distracted driving
What are the 95% credible intervals for \(Beta(11,\ 39)\)?
Ppsterior density of probability of distract drivers
Simulate the probabilities of distracted drivers for the next 10 cars with posterior, \(Beta(11,\ 39)\)?
Probability of distract drivers for next 10 cars
How can we extend Bayes models to more complex problems?
For simple problems we can use a conjugate prior and posterior
Unlikely posterior distribution will be a simple conjugate
Need to perform sampling to compute approximation of complex posterior
We need highly efficient sampling methods for complex problems
Real-world Bayes models have large numbers of parameters, into the millions
Naive approach is simple grid sampling
Grid sampling does not scale
Consider this thought experiment, sampling dimension 100 times:
Computational complexity of grid sampling has exponential scaling with dimensionality
Need a better approach!
Large-scale Bayesian models need highly efficient sampling methods
Markov chain Monte Carlo (MCMC) sampling is efficient and scalable
Rather than sampling on a grid MCMC methods sample distributions randomly
Requires effort to understand how the algorithm works and what to do when things go wrong
MCMC sampling uses a chain of Markov sampling processes
Since Markov transition process depends only on the current state a Markov process is memoryless
\[P(X_{t + 1}| X_t = x_t, x_{t-1}, x_{t-2}, \ldots, x_0) = p(X_{t + 1}| x_t)\]
For system with \(N\) possible states we can write the transition probability matrix, \(\Pi\):
\[\Pi = \begin{bmatrix} \pi_{1,1} & \pi_{1,2} & \cdots & \pi_{1, N}\\ \pi_{2,1} & \pi_{2,2} & \cdots & \pi_{2,N}\\ \cdots & \cdots & \cdots & \cdots \\ \pi_{N,i} & \pi_{N,2} & \cdots & \pi_{N,N} \end{bmatrix}\\ where\\ \pi_{i,j} = probability\ of\ transition\ from\ state\ i\ to\ state\ j\\ and\\ \pi_{i,i} = probability\ of\ staying\ in\ state\ i\\ further\\ \pi_{i,j} \ne \pi_{j,i}\ in\ general \]
To make the foregoing more concrete let’s construct a simple example. We will start with a system of 3 states, \(\{ x_1, x_2, x_3 \}\). The transition matrix is:
\[\Pi = \begin{bmatrix} \pi_{1,1} & \pi_{1,2} & \pi_{1,3}\\ \pi_{2,1} & \pi_{2,2} & \pi_{2,3}\\ \pi_{3,1} & \pi_{3,2} & \pi_{3,3} \end{bmatrix} = \begin{bmatrix} 0.5 & 0.0 & 0.6\\ 0.2 & 0.3 & 0.4\\ 0.3 & 0.7 & 0.0 \end{bmatrix} \]
Let’s apply a probability matrix to a set of possible states
\[\vec{x}_{t+1} = \Pi\ \vec{x}_t = \begin{bmatrix} 0.5 & 0.0 & 0.6\\ 0.2 & 0.3 & 0.4\\ 0.3 & 0.7 & 0.0 \end{bmatrix} \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.2 \\ 0.3 \end{bmatrix} \]
The foregoing is a single step of a Markov process
What happens when there is a series of transitions?
A sequence of such transitions is known as a Markov chain
There are two major behaviors observed with Markov Chains
Episodic Markov chains have a terminal state
Continuous Markov chains have no terminal state
Markov chain comprises a number of state transitions
Chain of \(n\) state transitions, \(\{t_1, t_2, t_3, \ldots, t_n \}\)
Each transition has the probabilities given by the state transition matrix, \(\Pi\)
To estimate the probabilities of being in the states we use a special case known as a stationary Markov chain
Start with initial state, \(\vec{x}_0\) for a continuous Markov chain:
\[\Pi\ \Pi\ \Pi\ \ldots \Pi\ \vec{x}_t = \Pi^n\ \vec{x}_t \xrightarrow[\text{$n \rightarrow \infty$}]{} \vec{p(x)}\]
We can find the probabilities of the states without knowing the values of the transition matrix, \(\Pi\)!
As long as we can repeatedly sample the stochastic Markov process, we can estimate the state probabilities
This is the key to Markov Chain Monte Carlo sampling
The first MCMC sampling algorithm developed is the Metropolis-Hastings (M-H) algorithm; often referred to as Metropolis algorithm or the M-H algorithm.
\[Acceptance\ probability\ = \frac{p(data | new\ parameters)}{p(data | previous\ parameters)}\].
If the sample is accepted, compute the posterior density at the new sample point
Repeat sampling steps many times, until convergence
Eventually, the M-H algorithms converges to the posterior distribution
M-H random sampling algorithm is far more sample efficient than naive grid sampling
Consider that the M-H algorithm probabilistically samples the parameter space
Important properties of the Metropolis-Hastings MCMC algorithm include:
Poor convergence arrises from low sample efficiency
Algorithm must be ‘tuned’ to ensure sample efficiency
Tuning finds a good dispersion parameter value for the state sampling distribution
Dispersion parameter determines the size of the jumps the algorithm makes
Example, for Normal distribution pick the variance \(\sigma^2\)
Let’s try a simple example, find an estimate of the posterior density of a bivariate Normal distribution
\[\mu = \begin{bmatrix} .5\\ .5 \end{bmatrix}\]
\[Covariance = \begin{bmatrix} 1, .6\\ .6, 1 \end{bmatrix}\]
Random samples from a bivariate Normal distribution
And, the marginal distributions of the variables
Marginal distributions of bivariate Normal samples
Now, we are ready to sample these data using the M-H MCMC algorithm
The algorithm is
MCMC samples from bivariate Normal distribution
Plot the first 500 samples
First 500 MCMC samples from bivariate Normal distribution
Notice the long ‘tail’ on the sampled distribution from the burn-in period.
Marginal distributions of the MCMC samples, less first 500
First 500 MCMC samples from bivariate Normal distribution
These marginals are similar to the original distribution
Hoe can we understand the onvergence properties of the M-H MCMC sampler
MCMC sampling generally convergences to the underlying distribution, but can be slow
In some pathological cases, convergence may not occur at all
The acceptance rate and rejection rate are key convergence statistics for the M-H algorithm
For our example these statistics are fairly good:
\[Acceptance\ rate = 0.81\] \[Rejection\ rate = 0.19\]
Trace plot of the samples displays the sample value of the parameter with sample number
Trace plots for two variables from the M-H algorithm
The traces show sampling around the highest density values of the parameters, indicating good convergence
An autocorrelation plot shows how a sample value is related to the previous samples
Autocorrelation of Markov chain for the two parameters
Notice that the autocorrelation dies off fairly quickly with lag
Low auto correlation indicates good sampling efficiency
We can relate sampling efficiency to the autocorrelation of the samples
\[ESS = \frac{N}{1 + 2 \sum_k ACF(k)}\]
A number of other powerful MCMC sampling algorithms have been developed
Gibbs sampler is an improved MCMC sampler which speeds convergence
Named for the 19th Century physicist Josiah Willard Gibbs; inspired by statistical mechanics
Gibbs sampler samples each dimension of the parameter space sequentially in a round-robin manner
M-H algorithm attempts jumps across all dimensions of the parameter space.
Compared to the M-H, Gibbs sampling reduces serial correlation through round-robin sampling
Update along each dimension approximately orthogonal to the preceding sampled dimension
There are no tuning parameters since sampling is based on the marginals of the likelihood.
The basic Gibbs sampler algorithm has the following steps:
For an N dimensional parameter space, \(\{ \theta_1, \theta_2, \ldots, \theta_N \}\), find a random starting point
In order, \(\{1, 2, 3, \ldots, N\}\), assign the next dimension to sample, starting with dimension \(1\); actual order not important
Sample the marginal distribution of the parameter given the observations, \(D\), and other parameter values: \(p(\theta_1|D, \theta_2, \theta_3, \ldots, \theta_N)\)
Repeat steps 2 and 3 until convergence
The Hamiltonian sampler was proposed in the early 1980s and uses a simple idea from classical mechanics
NUTS represents the state of the art in MCMC samplers
PyMC3 package uses the NUTS MCMC algorithm.
NUTS improves on HMC
Ball is prevented from turning around
Why even discuss other samplers when we have the NUTS
So far, we have only worked with models having a simple flat parameter structure
For complex models fully employing the power of Bayesian methods, we must use hierarchical models
Hierarchical models allow us to compute posterior distributions
Build a heirarchy of parameter dependencies by factoring the joint distribution
Use the chain rule of probability to factor a joint distribution into hierarchy
\[P(A,B) = P(A|B)P(B)\]
\[P(A_1, A_2, A_3, A_4 \ldots, A_n) = P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2, A_3, A_4 \ldots, A_n)\]
\[P(A_1, A_2, A_3, A_4 \ldots, A_n) =\\ P(A_1 | A_2, A_3, A_4, \ldots, A_n)\ P(A_2 | A_3, A_4 \ldots, A_n)\\ P(A_3| A_4 \ldots, A_n) \ldots P(A_n)\]
The factorization is not unique.
Factor the variables in any order
For a joint distribution with \(n\) variables, there are \(n!\) unique factorizations
Example, we can factorize the foregoing distribution as:
\[P(A_1, A_2, A_3, A_4 \ldots, A_n) =\\ P(A_n | A_{n-1}, A_{n-2}, A_{n-3}, \ldots, A_1)\ P(A_{n-1}| A_{n-2}, A_{n-3}, \ldots, A_1)\\ P(A_{n-2}| A_{n-3}, \ldots, A_1) \ldots p(A_1)\]
Apply the chain rule of probability to Bayes theorem
Allows us to create hierarchical Bayesian models
Consider an example of a multi-parameter model; estimating both the mean, \(\mu\), and standard deviation, \(\sigma\), of a Normal distribution
\[p(\mu, \sigma | D) \propto p(D| \mu, \sigma) p(\mu, \sigma)\]
\[\begin{align} p(\mu, \sigma | D) &\propto p(D | \mu) p(\mu | \sigma) p(\sigma)\\ &\propto\ Likelihood\ *\ Conditional\ Distribution\ of\ \mu\ given\ \sigma\ *\ Prior\ of\ \sigma \end{align}\]
Use the Gamma distribution for the prior of the standard deviation
Density of the Gamma distribution with scale-2.0
We are now ready to create an hierarchical Bayes model
Hierarchical model for the posterior distribution
In mathematical terms we can define the hierarchical model as follows:
Prior of the scale, \(\sigma\), is \(Gamma(2, 2)\); fairly noninformative
Prior of the mean, \(\mu\), is \(N(0, 3)\), fairly dispersed and noninformative
The posterior distribution uses the priors for \(\mu\) and \(\sigma\) and a Normal likelihood
Using PyMC3 to sample the posterior distribution
def create_model(y):
## define a PyMC3 model object
model = pymc3.Model()
with model:
# Define the prior distributions for the two model parameters, location, mu and scale
scale = pymc3.Gamma('scale', alpha=2,beta=2)
mu = pymc3.Normal('mu', mu=0, sigma=3.0)
# The model for the posterior distribution uses the priors of the location and scale
# to model the observations.
y_obs = pymc3.Normal('y_obs', mu=mu, sigma=scale, observed=y)
return modelPosterior distribution and trace plots for two model parameters
PyMC3 provides quite a number of model evaluation statistics for 4 chains X 1,000 samples each
The summary shows a lot of useful information:
Summary of MCMC model performance
Paramter means an statistics
msce_mean is an estimate of the error arising from the MCMC sampling itself
ESS statistics
Gelman-Rudin statistic (Rhat) measures the ratio of the variance shrinkage between chains to the variance shrinkage within chain
Credible intervals from MCMC sampling of model parameters
For complex Bayesian models we need a computationally efficient aproximation
Grid sampling is inefficient
MCMC sampling is based on Markov chains
Several MCMC sampling methods have been developed
NUTS is the state of the art MCMC algorithm
Hierarchical model can be sued to represent complex relationships
Hierarchical model based on factorization of joint distribution
Model parameters (hyperparamters) each have a prior
The model is sampled with MCMC algorithm
Performance metrics of MCMC sampling